INTERSPEECH.2022 - Speech Processing

Total: 175

#1 Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion [PDF] [Copy] [Kimi1]

Authors: Tuan Vu Ho ; Maori Kobayashi ; Masato Akagi

In most of practical scenarios, the announcement system must deliver speech messages in a noisy environment, in which the background noise cannot be cancelled out. The local noise reduces speech intelligibility and increases listening effort of the listener, hence hamper the effectiveness of announcement system. There has been reported that voices of professional announcers are clearer and more comprehensive than that of non-expert speakers in noisy environment. This finding suggests that the speech intelligibility might be related to the speaking style of professional announcer, which can be adapted using voice conversion method. Motivated by this idea, this paper proposes a speech intelligibility enhancement in noisy environment by applying voice conversion method on non-professional voice. We discovered that the professional announcers and non-professional speakers are clusterized into different clusters on the speaker embedding plane. This implies that the speech intelligibility can be controlled as an independent feature of speaker individuality. To examine the advantage of converted voice in noisy environment, we experimented using test words masked in pink noise at different SNR levels. The results of objective and subjective evaluations confirm that the speech intelligibility of converted voice is higher than that of original voice in low SNR conditions.

#2 Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement [PDF] [Copy] [Kimi1]

Authors: Tuan Vu Ho ; Quoc Huy Nguyen ; Masato Akagi ; Masashi Unoki

Recent speech enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech enhancement method by estimating the magnitude and phase of a complex adaptive Wiener filter. In this method, a noise-robust vector-quantized variational autoencoder is utilized for estimating the magnitude Wiener filter by using the Itakura-Saito divergence on time-frequency domain, while the phase of the Wiener filter is estimated by a convolutional recurrent network using the scale-invariant signal-to-noise ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech enhancement studies and achieved the PESQ score of 2.85 and STOI score of 0.94, which is better than the state-of-art method based on cIRM estimation in the 2020 Deep Noise Challenge.

#3 iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement [PDF] [Copy] [Kimi1]

Authors: Minseung Kim ; Hyungchan Song ; Sein Cheong ; Jong Won Shin

Deep learning approaches have been successfully applied to single channel speech enhancement exhibiting significant performance improvement. Recently, approaches unifying deep learning techniques into a statistical speech enhancement framework were proposed, including Deep Xi and DeepMMSE in which a priori signal-to-noise ratios (SNRs) were estimated by deep neural networks (DNNs) and noise power spectral density (PSD) and spectral gain functions were computed with estimated parameters. In this paper, we propose an improved DeepMMSE (iDeepMMSE) which estimates the speech PSD and speech presence probability as well as the a priori SNR using a DNN for MMSE estimation of the speech and noise PSDs. The a priori and a posteriori SNRs are refined with the estimated PSDs, which in turn are used to compute spectral gain function. We also replaced the DNN architecture with the Conformer which efficiently captures the local and global sequential information. Experimental results on the Voice Bank-DEMAND dataset and Deep Xi dataset showed the proposed iDeepMMSE outperformed the DeepMMSE in terms of the perceptual evaluation of speech quality (PESQ) scores and composite objective measures.

#4 Boosting Self-Supervised Embeddings for Speech Enhancement [PDF] [Copy] [Kimi1]

Authors: Kuo-Hsuan Hung ; Szu-wei Fu ; Huan-Hsin Tseng ; Hsin-Tien Chiang ; Yu Tsao ; Chii-Wann Lin

Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representation and spectrogram, the result can be significantly boosted. We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE. Consequently, we found that SSL representations with lower noise robustness are more important. Furthermore, our experiments on the VCTK-DEMAND dataset demonstrated that fine-tuning an SSL representation with an SE model can outperform the SOTA SSL-based SE methods in PESQ, CSIG and COVL without invoking complicated network architectures. In later experiments, the CN distance in SSL embeddings was observed to increase after fine-tuning. These results verify our expectations and may help design SE-related SSL training in the future.

#5 Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections [PDF] [Copy] [Kimi1]

Authors: Seorim Hwang ; Sung Wook Park ; Youngcheol Park

Capturing the contextual information in multi-scale is known to be beneficial for improving the performance of DNN-based speech enhancement (SE) models. This paper proposes a new SE model, called NUNet-TLS, having two-level skip connections between the residual U-Blocks nested in each layer of a large U-Net structure. The proposed model also has a causal time-frequency attention (CFTA) at the output of the residual U-Block to boost dynamic representation of the speech context in multi-scale. Even having the two-level skip connections, the proposed model slightly increases the network parameters, but the performance improvement is significant. Experimental results show that the proposed NUNet-TLS has superior performance in various objective evaluation metrics to other state-of-the-art models. The code of our model is available at https://github.com/seorim0/NUNet-TLS

#6 CycleGAN-based Unpaired Speech Dereverberation [PDF] [Copy] [Kimi1]

Authors: Hannah Muckenhirn ; Aleksandr Safin ; Hakan Erdogan ; Felix de Chaumont Quitry ; Marco Tagliasacchi ; Scott Wisdom ; John R. Hershey

Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained on large amounts of data and a variety of room impulse responses when the data is synthetically reverberated, since acquiring real paired data is costly. In this paper we propose a CycleGAN-based approach that enables dereverberation models to be trained on unpaired data. We quantify the impact of using unpaired data by comparing the proposed unpaired model to a paired model with the same architecture and trained on the paired version of the same dataset. We show that the performance of the unpaired model is comparable to the performance of the paired model on two different datasets, according to objective evaluation metrics. Furthermore, we run two subjective evaluations and show that both models achieve comparable subjective quality on the AMI dataset, which was not seen during training.

#7 Attentive Training: A New Training Framework for Talker-independent Speaker Extraction [PDF] [Copy] [Kimi1]

Authors: Ashutosh Pandey ; DeLiang Wang

Listening in a multitalker scenario, we typically attend to a single talker through auditory selective attention. Inspired by human selective attention, we propose attentive training: a new training framework for talker-independent speaker extraction with an intrinsic selection mechanism. In the real world, multiple talkers very unlikely start speaking at the same time. Based on this observation, we train a deep neural network to create a representation for the first speaker and utilize it to extract or track that speaker from a multitalker noisy mixture. Experimental results demonstrate the superiority of attentive training over widely used permutation invariant training for talker-independent speaker extraction, especially in mismatched conditions in terms of the number of speakers, speaker interaction patterns, and the amount of speaker overlaps.

#8 Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement [PDF] [Copy] [Kimi1]

Authors: Tyler Vuong ; Richard Stern

We describe an improved modulation-domain loss for deeplearning- based speech enhancement systems (SE). We utilized a simple self-supervised speech reconstruction task to learn a set of spectro-temporal receptive fields (STRFs). Similar to the recently developed spectro-temporal modulation error, the learned STRFs are used to calculate a weighted mean-squared error in the modulation domain for training a speech enhancement system. Experiments show that training the SE systems using the improved modulation-domain loss consistently improves the objective prediction of speech quality and intelligibility. Additionally, we show that the SE systems improve the word error rate of a state-of-the-art automatic speech recognition system at low SNRs.

#9 Perceptual Characteristics Based Multi-objective Model for Speech Enhancement [PDF] [Copy] [Kimi1]

Authors: Chiang-Jen Peng ; Yun-Ju Chan ; Yih-Liang Shen ; Cheng Yu ; Yu Tsao ; Tai-Shih Chi

Deep learning has been widely adopted for speech applications. Many studies have shown that using the multiple objective framework and learned deep features is effective for improving system performance. In this paper, we propose a perceptual characteristics based multi-objective speech enhancement (SE) algorithm that combines the conventional loss and objective losses of pitch and timbre related features. Timbre related features include frequency modulation (encoded by the pitch contour), amplitude modulation (encoded by the energy contour), and speaker identity. For the speaker identity loss, we consider the deep features derived in a speaker identification system. The proposed algorithm consists of two parts, a LSTM based SE model and CNN based multi-objective models. The objective losses are derived between speech enhanced by the SE model and clean speech and combined with the SE loss for updating the SE model. The proposed algorithm is evaluated using the corpus of Taiwan Mandarin hearing in noise test (TMHINT). Experimental results show the proposed algorithm evidently outperforms the original SE model in all objective scores, including speech quality, speech intelligibility and signal distortion.

#10 Listen only to me! How well can target speech extraction handle false alarms? [PDF] [Copy] [Kimi1]

Authors: Marc Delcroix ; Keisuke Kinoshita ; Tsubasa Ochiai ; Katerina Zmolikova ; Hiroshi Sato ; Tomohiro Nakatani

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. This is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.

#11 Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction [PDF] [Copy] [Kimi1]

Authors: Hao Shi ; Longbiao Wang ; Sheng Li ; Jianwu Dang ; Tatsuya Kawahara

Many state-of-the-art speech enhancement (SE) systems have recently used convolutional neural networks (CNNs) to extract multi-scale feature maps. However, CNN relies more on local texture than global shape, which is more susceptible to degraded spectrogram and may fail to capture the detailed structure of speech. Although some two-stage systems feed the first-stage enhanced and original noisy spectrograms to the second stage simultaneously, this does not guarantee sufficient guidance for the second stage since the first-stage spectrogram can not provide precise spectral details. In order to allow CNNs to perceive clear speech component boundary information, we compose feature maps with spectrograms containing evident speech components according to the mask value from the first stage. The positions corresponding to the mask greater than certain thresholds are extracted as feature maps. These feature maps make the boundary information of speech components obvious by ignoring others, thus making CNNs sensitive to input features. Experiments on the VB dataset show that with a proper decomposition numbers, the proposed method can enhance SE performance, which can provide 0.15 PESQ improvement. Besides, the proposed method is more effective for spectral detail recovery.

#12 Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments [PDF] [Copy] [Kimi1]

Authors: Jean-Marie Lemercier ; Joachim Thiemann ; Raphael Koning ; Timo Gerkmann

In this paper, a neural network-augmented algorithm for noise-robust online dereverberation with a Kalman filtering variant of the weighted prediction error (WPE) method is proposed. The filter stochastic variations are predicted by a deep neural network (DNN) trained end-to-end using the filter residual error and signal characteristics. The presented framework allows for robust dereverberation on a single-channel noisy reverberant dataset similar to WHAMR!. The Kalman filtering WPE introduces distortions in the enhanced signal when predicting the filter variations from the residual error only, if the target speech power spectral density is not perfectly known and the observation is noisy. The proposed approach avoids these distortions by correcting the filter variations estimation in a data-driven way, increasing the robustness of the method to noisy scenarios. Furthermore, it yields a strong dereverberation and denoising performance compared to a DNN-supported recursive least squares variant of WPE, especially for highly noisy inputs.

#13 PodcastMix: A dataset for separating music and speech in podcasts [PDF] [Copy] [Kimi1]

Authors: Nicolás Schmidt ; Jordi Pons ; Marius Miron

We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. We aim at defining a benchmark suitable for training and evaluating (deep learning) source separation models. To that end, we release a large and diverse training dataset based on programatically generated podcasts. However, current (deep learning) models can incur into generalization issues, specially when trained on synthetic data. To target potential generalization issues, we release an evaluation set based on real podcasts for which we design objective and subjective tests. Out of our experiments with real podcasts, we find that current (deep learning) models may have generalization issues. Yet, these can perform competently, e.g., our best baseline separates speech with a mean opinion score of 3.84 (rating ``overall separation quality" from 1 to 5). The dataset and baselines are accessible online.

#14 Independence-based Joint Dereverberation and Separation with Neural Source Model [PDF] [Copy] [Kimi1]

Authors: Kohei Saijo ; Robin Scheibler

We propose an independence-based joint dereverberation and separation method with a neural source model. We introduce a neural network in the framework of time-decorrelation iterative source steering, which is an extension of independent vector analysis to joint dereverberation and separation. The network is trained in an end-to-end manner with a permutation invariant loss on the time-domain separation output signals. Our proposed method can be applied in any situation with at least as many microphones as sources, regardless of their number. In experiments, we demonstrate that our method results in high performance in terms of both speech quality metrics and word error rate (WER), even for mixtures with a different number of speakers than training. Furthermore, the model, trained on synthetic mixtures, without any modifications, greatly reduces the WER on the recorded dataset LibriCSS.

#15 Spatial Loss for Unsupervised Multi-channel Source Separation [PDF] [Copy] [Kimi1]

Authors: Kohei Saijo ; Robin Scheibler

We propose a spatial loss for unsupervised multi-channel source separation. The proposed loss exploits the duality of direction of arrival (DOA) and beamforming: the steering and beamforming vectors should be aligned for the target source, but orthogonal for interfering ones. The spatial loss encourages consistency between the mixing and demixing systems from a classic DOA estimator and a neural separator, respectively. With the proposed loss, we train the neural separators based on minimum variance distortionless response (MVDR) beamforming and independent vector analysis (IVA). We also investigate the effectiveness of combining our spatial loss and a signal loss, which uses the outputs of blind source separation as the references. We evaluate our proposed method on synthetic and recorded (LibriCSS) mixtures. We find that the spatial loss is most effective to train IVA-based separators. For the neural MVDR beamformer, it performs best when combined with a signal loss. On synthetic mixtures, the proposed unsupervised loss leads to the same performance as a supervised loss in terms of word error rate. On LibriCSS, we obtain close to state-of-the-art performance without any labeled training data.

#16 Effect of Head Orientation on Speech Directivity [PDF] [Copy] [Kimi1]

Authors: Samuel Bellows ; Timothy W. Leishman

The directional characteristics of human speech have many applications in speech acoustics, audio, telecommunications, room acoustical design, and other areas. However, professionals in these fields require carefully conducted, high-resolution, spherical speech directivity measurements taken under distinct circumstances to gain additional insights for their work. Because head orientation and human-body diffraction influence speech radiation, this work explores such effects under various controlled conditions through the changing directivity patterns of a head and torso simulator. The results show that head orientation and body diffraction at low frequencies impact directivities only slightly. However, the effects are more substantial at higher frequencies, particularly above 1 kHz.

#17 Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated Signals [PDF] [Copy] [Kimi1]

Authors: Kohei Saijo ; Tetsuji Ogawa

We present an unsupervised training method of the sequential neural beamformer (Seq-BF) using coarsely-separated and non-separated supervisory signals. The signal coarsely separated by blind source separation (BSS) has been used for training neural separators in an unsupervised manner. However, the performance is limited due to distortions in the supervision. In contrast, remix-cycle-consistent learning (RCCL) enables a separator to be trained on distortion-free observed mixtures by making the remixed mixtures obtained by repeatedly separating and remixing the two different mixtures closer to the original mixtures. Still, training with RCCL from scratch often falls into a trivial solution, i.e., not separating signals. The present study provides a novel unsupervised learning algorithm for the Seq-BF with two stacked neural separators, in which the separators are pre-trained using the BSS outputs and then fine-tuned with RCCL. Such configuration compensates for the shortcomings of both approaches: the guiding mechanism in Seq-BF accelerates separation to exceed BSS performance, thereby stabilizing RCCL. Experimental comparisons demonstrated that the proposed unsupervised learning achieved performance comparable to supervised learning (0.4 point difference in word error rate).

#18 Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language [PDF] [Copy] [Kimi1]

Authors: Marvin Borsdorf ; Kevin Scheck ; Haizhou Li ; Tanja Schultz

We introduce blind language separation (BLS) as novel research task, in which we seek to disentangle overlapping voices of multiple languages by language. BLS is expected to separate seen as well as unseen languages, which is different from the target language extraction task that works for one seen target language at a time. To develop a BLS model, we simulate a multilingual cocktail party database, of which each scene consists of two randomly selected languages, each represented by two randomly selected speakers. The database follows the recently proposed GlobalPhoneMCP database design concept that uses the audio data of the GlobalPhone 2000 Speaker Package. We show that a BLS model is able to learn the language characteristics so as to disentangle overlapping voices by language. We achieve a mean SI-SDR improvement of 12.63 dB over 231 test sets. The performance on the individual test sets varies depending on the language combination. Finally, we show that BLS can generalize well to unseen speakers and languages in the mixture.

#19 NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic Domain [PDF] [Copy] [Kimi1]

Authors: Mateusz Guzik ; Konrad Kowalczyk

This paper presents a novel Non-negative Tensor Factorization (NTF) based approach to tracking and separation of moving sound sources, formulated in the Spherical Harmonic Domain (SHD). In particular, at first, we redefine an already existing Ambisonic NTF by introducing time-dependence into the Spatial Covariance Matrix (SCM) model. Next, we further extend the time-dependent SCM by incorporating a newly proposed NTF model of the spatial features, thereby introducing spatial components. To exploit the relationship between the positions of sound sources in adjacent time frames, resulting from the naturally occurring continuity of the movement itself, we impose local smoothness on time-dependent components of the spatial features. To this end, we propose a suitable posterior probability with Gibbs prior, and finally we derive the corresponding update rules. The experimental evaluation is based on first-order Ambisonic recordings of speech utterances and musical instruments in several scenarios with moving sources.

#20 Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation [PDF] [Copy] [Kimi1]

Authors: Jack Deadman ; Jon Barker

Simulation plays a crucial role in developing components of automatic speech recognition systems such as enhancement and diarization. In source separation and target-speaker extraction, datasets with high degrees of temporal overlap are used both in training and evaluation. However, this contrasts with the fact that people tend to avoid such overlap in real conversations. It is well known that artifacts introduced from pre-processing with no overlapping speech can be detrimental to recognition performance. This work proposes a finite-state based generative method trained on timing information in speech corpora, which leads to two main contributions. First, a method for generating arbitrary large datasets which follow desired statistics of real parties. Second, features extracted from the models are shown to have a correlation with speaker extraction performance. This leads to the contribution of quantifying how much difficulty in a mixture is due to turn-taking, factoring out other complexities in the signal. Models which treat speakers as independent produce poor generation and representation results. We improve upon this by proposing models which have states conditioned on whether another person is speaking.

#21 An Initialization Scheme for Meeting Separation with Spatial Mixture Models [PDF] [Copy] [Kimi1]

Authors: Christoph Boeddeker ; Tobias Cord-Landwehr ; Thilo von Neumann ; Reinhold Haeb-Umbach

Spatial mixture model (SMM) supported acoustic beamforming has been extensively used for the separation of simultaneously active speakers. However, it has hardly been considered for the separation of meeting data, that are characterized by long recordings and only partially overlapping speech. In this contribution, we show that the fact that often only a single speaker is active can be utilized for a clever initialization of an SMM that employs time-varying class priors. In experiments on LibriCSS we show that the proposed initialization scheme achieves a significantly lower Word Error Rate (WER) on a downstream speech recognition task than a random initialization of the class probabilities by drawing from a Dirichlet distribution. With the only requirement that the number of speakers has to be known, we obtain a WER of 5.9 %, which is comparable to the best reported WER on this data set. Furthermore, the estimated speaker activity from the mixture model serves as a diarization based on spatial information.

#22 Prototypical speaker-interference loss for target voice separation using non-parallel audio samples [PDF] [Copy] [Kimi1]

Authors: Seongkyu Mun ; Dhananjaya Gowda ; Jihwan Lee ; Changwoo Han ; Dokyun Lee ; Chanwoo Kim

In this paper, we propose a new prototypical loss function for training neural network models for target voice separation. Conventional methods use paired parallel audio samples of the target speaker with and without an interfering speaker or noise, and minimize the spectrographic mean squared error (MSE) between the clean and enhanced target speaker audio. Motivated by the use of contrastive loss in speaker recognition task, we had earlier proposed a speaker representation loss that uses representative samples from the target speaker in addition to the conventional MSE loss. In this work, we propose a prototypical speaker-interference (PSI) loss, that makes use of representative samples from the target speaker, interfering speaker as well as the interfering noise to better utilize any non-parallel data that may be available. The performance of the proposed loss function is evaluated using VoiceFilter, a popular framework for target voice separation. Experimental results show that the proposed PSI loss significantly improves the PESQ scores of the enhanced target speaker audio.

#23 Relationship between the acoustic time intervals and tongue movements of German diphthongs [PDF] [Copy] [Kimi1]

Authors: Arne-Lukas Fietkau ; Simon Stone ; Peter Birkholz

This study investigated the relationship between tongue movements during the production of German diphthongs and their acoustic time intervals. To this end, five subjects produced a set of logatomes that contained German primary, secondary, and peripheral diphthongs in the context of bilabial and labiodental consonants at three different speaking rates. During the utterances, tongue movements were measured by means of optical palatography (OPG), i.e. by optical distance sensing in the oral cavity, along with the acoustic speech signal. The analysis of the movement signals revealed that the diphthongs have s-shaped tongue trajectories that strongly resemble half cosine periods. In addition, acoustic and articulatory diphthong durations have a linear, but not proportional, relationship. Finally, the peak velocity and midpoint between the two targets of a diphthong are reached in the middle of both the acoustic and articulatory diphthong time intervals, regardless of the duration and type of diphthong. These results can help to model realistic tongue movements for diphthongs in articulatory speech synthesis.

#24 Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese children [PDF] [Copy] [Kimi1]

Authors: Sanae Matsui ; Kyoji Iwamoto ; Reiko Mazuka

The development of allophonic variants of phonemes is poorly understood. This study aimed to examine when the allophonic variants of a phoneme are realized like adults. Japanese children aged 5–13 years and adults participated in an elicited production task. We analyzed developmental changes in allophonic variation of the phoneme /z/, which is realized variably either as an affricate or a fricative. The results revealed that children aged nine years or younger realized /z/ as affricate significantly more than 13-year-old and adult speakers. Once the children reached 11 years of age, the difference compared to adults was not statistically significant, which denotes a similar developmental pattern as that of speech motor control (e.g., lip and jaw) and cognitive-linguistic skill. Moreover, we examined whether the developmental changes of allophonic realization of /z/ are due to speech rate and the time to articulate /z/. The results showed that the allophonic realization of /z/ is not affected by those factors, different from that of adults. We also found that the effects of speech rate and the time to articulate /z/ on the allophonic realization become adult-like at around 11 years of age.

#25 Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition [PDF] [Copy] [Kimi1]

Authors: Chung-Soo Ahn ; Chamara Kasun ; Sunil Sivadas ; Jagath Rajapakse

To infer emotions accurately from speech, fusion of audio and text is essential as words carry most information about semantics and emotions. Attention mechanism is essential component in multimodal fusion architecture as it dynamically pairs different regions within multimodal sequences. However, existing architecture lacks explicit structure to model dynamics between fused representations. Thus we propose recurrent multi-head attention in a fusion architecture, which selects salient fused representations and learns dynamics between them. Multiple 2-D attention layers select salient pairs among all possible pairs of audio and text representations, which are combined with fusion operation. Lastly, multiple fused representations are fed into recurrent unit to learn dynamics between fused representations. Our method outperforms existing approaches for fusion of audio and text for speech emotion recognition and achieves state-of-the-art accuracies on benchmark IEMOCAP dataset.